Sync NCAR/main branch of CCPP physics with ufs/dev branch#3190
Conversation
|
There are several time-outs in my Ursa RT run. I think that this may be related to our resources being overused and fairshare reflecting that, throttling our runs. |
|
I'll rerun the tests on Ursa as a secondary check. Also, a new GNU warning was introduced in this PR: Can we get that fixed as part of this PR? Should just be an |
Sure, I'll make that change and update the PR. |
|
It looks like the timeouts on the HAFS tests are real, unfortunately. I repeated one of the HAFS tests to make sure and it does seem to hang early on in the run. Every other test passed. My first guess is that something in the CCPP updates is conflicting with the nests. This will need to be resolved before bringing in these changes. |
@dpsarmie Thanks very much for rerunning those. I'll see if I can debug the HAFS tests that are timing out. I'm guessing that it's another issue with the new ccpp_bcast functionality, perhaps triggered by the nests. |
|
@grantfirl Just wanted to confirm that this PR is ready for review? If so, I'll request UFSATM reviews. |
No, there are still problems with the nested tests. I'll probably need help debugging these. I'm going to ask Dom, who made most of the changes related to mpi_bcast, to help. |
|
@jkbk2004 @gspetro-NOAA Do you know of anyone who could help debug this nesting problem? I described what I'm seeing for @climbfuji here: ufs-community/ccpp-physics#369 (comment) |
Perhaps @DusanJovic-NOAA would have some idea since it's happening in UFSATM? That said, @NickSzapiro-NOAA suggested that you could potentially revert some of the CCPP changes in your PR (like NCAR/ccpp-physics#1187) to isolate where the failures might be coming from. |
|
@DusanJovic-NOAA Do you know/remember what happens with the MPI root PE used in a nested situation? It seems as though there is a conflict with the MPI root PE when nesting is turned on. The MPI_bcast calls as part of this PR are trying to use the mpi_root value given to physics (which is set from FV3's mpp_root_pe() function) rather than a hard-coded 0 value, which is used in various places throughout the UFSATM code. I'm wondering if this is causing the conflict/hanging behavior noticed in this PR? Any insights are appreciated! |
Root PE on the nested domain is the rank of the first mpi task that runs the nest. For example, if a parent domain runs on 80 tasks, and nest runs on 60 tasks: first 80 tasks, ranks 0,..,79 will have root PE = 0, and |
|
I think I see where the problem is. In atmos_model.F90 we currently have:
I think solution is eiteher:
I'm not sure about the rules for how ccpp treats multiple domains, or if it lacks a concept of multiple domains and treats all point columns uniformly, just as a list/array of columns. This hasn't been an issue so far, probably because ccpp-physics hardcodes '0' for the root of MPI collective calls. Basically solution 1) reverts to that behavior. |
|
@DusanJovic-NOAA I didn't follow this closely, but regarding the CCPP question: This depends on how the multiple domains are implemented. Is there a separate "copy" (instance) of CCPP for the global domain (the six tiles) and the nested domain (one tile)? Or are the grid points of the nested domain simply appended to those of the global domain? |
|
@climbfuji I'm not sure I fully understand what you mean by 'a separate "copy" (instance) of CCPP for the global domain (the six tiles) and the nested domain (one tile)'. If I understand correctly how this works, CCPP runs over a list of columns on each MPI task. Each task runs it's own copy/instance of CCPP. Does CCPP 'know' which (geographical) domain that list of points belongs to? I do not think it does. If it does 'know,' how does it know? The notion of domains, and the potential need for ccpp to 'know' this, is relevant if, for example, collective calls should only communicate with other (MPI) tasks belonging to the same domain, for example finding 'per domain' average, minimum, or maximum. Or something like that. I don't know if this is or ever will be needed in ccpp. In that case, domain communicators will be required, I think. |
|
@DusanJovic-NOAA @climbfuji I also had the idea of setting Init_parm%master to 0 as a test, and I can confirm that it does indeed work (model no longer hangs). I'm just not sure that it is the preferred solution. Dom, if a different host model handles MPI communicators and roots differently, as long as all ccpp_bcast calls (and any remaining straight mpi_bcast calls) are set up to use whatever communicator/root that they're given, instead of hard-coded, is that OK from your perspective? |
Yes, absolutely. I am pretty sure that's how it works in NEPTUNE. We have a specific MPI communicator with a specific MPI root and CCPP should (and hopefully does) use those exclusively. Never MPI_COMM_WORLD or 0 by default. |
@gspetro-NOAA Oops, so sorry. Should be fixed now. |
|
@grantfirl On Hercules, I have 5 tests failing for the same reason with a TEST DID NOT COMPLETE status:
In the logs, the first error is: Follow-on errors look like: I am going to reclone and run just those 5 tests to see if something was wonky with the clone because I've never seen an initial error like that before. Thought I'd post tho in case it triggers your memory about anything. |
|
@gspetro-NOAA could it be a space issue? |
|
@gspetro-NOAA I have never come across those kinds of errors in this PR or elsewhere. It sure seems machine-related and not code-related to me. |
Happily, it looks like it was a random hiccup! All the failing Hercules tests passed on rerun. 🤷♀️ |
|
Testing has completed successfully on all platforms. Note that the failing CI is expected. Repo_check is failing due to the merge of an LM4 PR outside of the PR process, and there is an increase of 1 remark on several tests due to a new "ifort: remark #10448" which will go away w/the switch to LLVM. |
Commit Queue Requirements:
test_changes.listindicates which tests, if any, are changed by this PR. Committest_changes.list, even if it is empty.Description:
This is primarily just a CCPP physics update, although it contains some host-side metadata changes to work with the ccpp-physics changes. See ufs-community/ccpp-physics#369 for a description of the CCPP-physics changes.
Commit Message:
Priority:
Git Tracking
UFSWM:
Sub component Pull Requests:
UFSWM Blocking Dependencies:
Documentation:
ccpp_bcastcapability.Changes
Regression Test Changes (Please commit test_changes.list):
Input data Changes:
Library Changes/Upgrades:
This does allow CCPP to use the IP library rather than the SP library, which is a prerequisite for using spack-stack 2.0+
Testing Log: